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Abstract — Scene understanding includes many related sub-tasks, such as scene categorization, depth estimation, object 
detection, etc. Each of these sub-tasks is often notoriously hard, and state-of-the-art classifiers already exist for many of them. 
These classifiers operate on the same raw image and provide correlated outputs. It is desirable to have an algorithm that can 
capture such correlation without requiring any changes to the inner workings of any classifier. 

We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that jointly optimizes all the sub-tasks, while requiring 
only a 'black-box' interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are 
repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method 
involves a feedback step that allows later classifiers to provide earlier classifiers information about which error modes to focus on. 
We show that our method significantly improves performance in all the sub-tasks in the domain of scene understanding, where 
we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency 
detection. Our method also improves performance in two robotic applications: an object-grasping robot and an object-finding 
robot. 

Index Terms — Scene understanding, Classification, Machine learning, Robotics. 
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1 Introduction 

ONE of the primary goals in computer vision is holistic 
scene understanding, which involves many sub-tasks, 
such as depth estimation, scene categorization, saliency 
detection, object detection, event categorization, etc. (See 
Figure [T]) Each of these tasks explains some aspect of a 
particular scene and in order to fully understand a scene, 
we would need to solve for each of these sub-tasks. Several 
independent efforts have resulted in good classifiers for 
each of these sub-tasks. In practice, we see that the sub- 
tasks are coupled — for example, if we know that the scene 
is an indoor scene, it would help us estimate depth from 
that single image more accurately. In another example in the 
robotic grasping domain, if we know what kind of object 
we are trying to grasp, then it is easier for a robot to figure 
out how to pick it up. In this paper, we propose a unified 
model that jointly optimizes for all the sub-tasks, allowing 
them to share information and guide the classifiers towards 
a joint optimal. We show that this can be seamlessly applied 
across different applications. 

Recently, several approaches have tried to combine these 
different classifiers for related tasks in vision ifTUlQl; 
however, most of them tend to be ad-hoc (i.e., a hard- 
coded rule is used) and often an intimate knowledge of 
the inner workings of the individual classifiers is required. 
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Even beyond vision, in many other domains, state-of-the-art 
classifiers already exist for many sub-tasks. However, these 
carefully engineered models are often tricky to modify, or 
even to simply re-implement from the available descrip- 
tions. Heitz et. al. [11] recently developed a framework for 
scene understanding called Cascaded Classification Models 
(CCM) treating each classifier as a 'black-box'. Each 
classifier is repeatedly instantiated with the next layer using 
the outputs of the previous classifiers as inputs. While this 
work proposed a method of combining the classifiers in a 
way that increased the performance in all of the four tasks 
they considered, it had a drawback that it optimized for 
each task independently and there was no way of feeding 
back information from later classifiers to earlier classifiers 
during training. This feedback can potentially help the 
CCM achieve a more optimal solution. 

In our work, we propose Feedback Enabled Cascaded 
Classification Models (FE-CCM), which provides feed- 
back from the later classifiers to the earlier ones, during 
the training phase. This feedback, provides earlier stages 
information about what error modes should be focused on, 
or what can be ignored without hurting the performance 
of the later classifiers. For example, misclassifying a street 
scene as highway may not hurt as much as misclassifying a 
street scene as open country. Therefore we prefer the first 
layer classifier to focus on fixing the latter error instead 
of optimizing the training accuracy. In another example, 
allowing the depth estimation to focus on some specific 
regions can help perform better scene categorization. For 
instance, the open country scene is characterized by its 
upper part as a wide sky area. Therefore, estimating the 
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Object detection | | Geometric layout! | Saliency detection 



Fig. 1 . Given a test image, Holistic Scene Understanding corre- 
sponds to inferring the labels for all possible scene understanding 
dimensions. In our work, we infer labels corresponding to, scene 
categorization, event categorization, depth estimation (Black = close, 
white = far), object detection, geometric layout (green = vertical, red 
= horizontal, blue = vertical) and saliency detection (cyan = salient) 
as shown above and achieve this jointly using one unified model. 
Note that different tasks help each other, for example, the depth 
estimate of the scene can help the object detector look for the horse; 
the object detection can help perform better saliency detection, etc. 

depth well in that region by sacrificing some regions in the 
bottom may help to correctly classify an image. In detail, 
we do so by jointly optimizing all the tasks; the outputs of 
the first layers are treated as latent variables and training 
is done using an iterative algorithm. Another benefit of our 
method is that each of the classifiers can be trained using 
their own independent training datasets, i.e., our model does 
not require a datapoint to have labels for all the sub-tasks, 
and hence it scales well with heterogeneous datasets. 

In our approach, we treat each classifier as a 'black-box', 
with no restrictions on its operation other than requiring the 
ability to train on data and have an input/output interface. 
(Often each of these individual classifier could be quite 
complex, e.g., producing labelings over pixels in an entire 
image.) Therefore, our method is applicable to many other 
tasks that have different but correlated outputs. 

In extensive experiments, we show that our method 
achieves significant improvements in the performance of 
all the six sub-tasks we consider: depth estimation, object 
detection, scene categorization, event categorization, geo- 
metric labeling and saliency detection. We also successfully 
apply the same model to two robotics applications: robotic 
grasping, and robotic object detection. 

The rest of the paper is organized as follows. We first 
define holistic scene understanding and discuss the related 
works in Section |2] We describe our FE-CCM method 
in Section [3] followed by the discussion about handling 
heterogeneous datasets in Section [4] We provide the imple- 
mentation details of the classifiers in Section [5] We present 
the experiments and results in Section [6] and some robotic 
applications in Section [7] We finally conclude in Section [8] 

2 Overview of Scene Understanding 

2.1 Holistic Scene Understanding 

When we look at an image of a scene, such as in Figure [T] 
we are often interested in answering several different ques- 
tions: What objects are there in the image? How far are 
things? What is going on in the scene? What type of scene 
it is? And so on. These are only a few examples of questions 
in the area of scene understanding; and there may even be 
more. 



In the past, the focus has been to address each task in 
isolation, where the goal of each task is to produce a label 
Yi G Si for the i th sub-task. If we are considering depth 
estimation (see Figure [If, then the label would be Y\ G 

51 = R^ 00xl0 ° for continuous values of depth in a 100 x 
100 output. For scene categorization, we will have Y 2 G 

52 = {1, . . . , K} for K scene classes. If we have n sub- 
tasks, then we would have to produce an output as: 

y = {y 1 ,...,y n }GS 1 xs 2 ...xs n . 

The interesting part here is that often we want to solve 
different combinations of the sub-tasks depending on the 
situation. The goal of this work is to design an algorithm 
that does not depend on the particular sub-tasks in question. 

2.2 Related Work 

Cascaded classifiers. Using information from related tasks 
to improve the performance of the task in question has 
been studied in various fields of machine learning. The idea 
of cascading layers of classifiers to aid a task was first 
introduced with neural networks as multi-level perceptrons 
where, the output of the first layer of perceptrons is passed 
on as input to the next layer (T2lfl4) . However, it is often 
hard to train neural networks and gain an insight into its 
operation, making it hard to work for complicated tasks. 

The idea of improving classification performance by 
combining outputs of many classifiers is used in methods 
such as Boosting (T5) , where many weak learners are 
combined to obtain a more accurate classifier; this has 
been applied to tasks such as face detection |[T6j U7). To 
incorporate contextual information, Fink and Perona (T8) 
exploited local dependencies between objects in a boost- 
ing framework, but did not allow for multiple rounds of 
communication between objects. Torralba et al. (T9) intro- 
duced Boosted Random Fields to model object dependency, 
which used boosting to learn the graph structure and local 
evidence of a conditional random field. Tu (20) proposed a 
more general framework which used pixel-level label maps 
to learn a contextual model through a cascaded classifier 
approach. All these works mainly consider the interactions 
between labels of the same type. However, in our CCM 
framework (21] [22), the focus is on capturing contextual 
interactions between labels of different types. Furthermore, 
compared to the feed-forward only cascade method in 
[20], our model with feedback not only iteratively refines 
the contextual interactions, but also refines the individual 
classifiers to provide helpful context. 

Sensor fusion. There has been a huge body of work in 
the area of sensor fusion where classifiers output the same 
labels but work with different modalities, each one giving 
additional information and thus improving the performance, 
e.g., in biometrics, data from voice recognition and face 
recognition is combined (23). However, in our scenario, 
we consider multiple tasks where each classifier is tackling 
a different problem (i.e., predicting different labels), with 
the same input being provided to all the classifiers. 

Structured Models for combining tasks. While the meth- 
ods discussed above combine classifiers to predict the same 
labels, there is a group of works that designs models 



for predicting heterogenous labels. Kumar and Hebert |[T1l 
developed a large MRF-based probabilistic model to link 
multi-class segmentation and object detection. Li et al. l24l 
modeled mulitple interactions within tasks and across tasks 
by defining a MRF over parameters. Similar efforts have 
been made in the field of natural language processing. 
Sutton and McCallum [ 6 ] combined a parsing model with 
a semantic role labeling model into a unified probabilis- 
tic framework that solved both simultaneously. Ando and 
Zhang l25l proposed a general framework for learning 
predictive functional structures from multiple tasks. All 
these models require knowledge of the inner workings of 
the individual classifiers, which makes it hard to fit existing 
state-of-the-art classifiers of certain tasks into the models. 

Structured learning algorithms (e.g., I26H281 ) can also be 
a viable option for the setting of combining multiple tasks. 
There has been recent development in structured learning 
on handling latent variables (e.g. hidden conditional random 
field (22, latent structured SVM (30)), which can be poten- 
tially applied to multi-task settings with disjoint datasets. 
With considerable understanding into each of the tasks, the 
loss function in structured learning provides a nice way to 
leverage different tasks. However, in this work, we focus 
on developing a more generic algorithm that can be easily 
applied even without intimate knowledge of the tasks. 

There have been many works which show that with a 
well-designed model, one can improve the performance of 
a particular task by using cues from other tasks (e.g., (7]- 
[9)). Saxena et al. manually designed the terms in an MRF 
to combine depth estimation with object detection [2] and 
stereo cues ifTOl . Sudderth et al. [5 ] used object recognition 
to help 3D structure estimation. 

Context. There is a large body of work that leverages con- 
textual information to help specific tasks. Various sources of 
context have been explored, ranging from the global scene 
layout, interactions between objects and regions to local 
features. To incorporate scene-level information, Torralba et 
al. (3U E2) used the statistics of low-level features across 
the entire scene to prime object detection or help depth 
estimation. Hoiem et al. l33l used 3D scene information to 
provide priors on potential object locations. Park et al. l34l 
used the ground plane estimation as contextual information 
for pedestrian detection. Many works also model context to 
capture the local interactions between neighboring regions 
I35U371I . objects ll38U42ll . or both I43lfl5l These methods 
improve the performance of some specific tasks by com- 
bining information from different aspects. However, most 
of these methods can not be applied to cases when we only 
have "black-box" classifiers for the individual tasks. 

Holistic Scene Understanding. Hoiem et. al. [ 3 ] proposed 
an innovative but ad-hoc system that combined bound- 
ary detection and surface labeling by sharing some low- 
level information between the classifiers. Li et. al. (H |46) 
combined image classification, annotation and segmenta- 
tion with a hierarchical graphical model. However, these 
methods required considerable attention to each classifier, 
and considerable insight into the inner workings of each 
task and also the connections between them. This limits 
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Fig. 2. The proposed Feed-back enabled cascaded classifi- 
cation model (FE-CCM) for combining related classifiers. (Vi e 
{1, 2, . . . ^i(X) = Features corresponding to Classifier \ ex- 
tracted from image X, Zi = Output of the Classifier^ in the first 
stage parameterized by 0i, Yi = Output of the Classifier \ in the 
second stage parameterized by uoi). In the proposed FE-CCM model, 
there is feed-back from the latter stages to help achieve a model 
which optimizes all the tasks considered, jointly. Here Classifier^s 
on the two layers can have different forms though they are for the 
same classification task. (Note that different colors of lines are used 
only to make the figure more readable.) 

the generality of the approaches in introducing new tasks 
easily or being applied to other domains. 

Deep Learning. There is also a large body of work in the 
areas of deep learning, and we refer the reader to Bengio 
and LeCun l47l for a nice overview of deep learning archi- 
tectures and Caruana l48l for multitask learning with shared 
representation. While efficient back-propagation methods 
like [49] have been commonly used in learning a multi- 
layer network, it is not as easy to apply to our case where 
each node is a complex classifier. Most works in deep learn- 
ing (e.g., 15QH521 ) are different from our work in that, those 
works focus on one particular task (same labels) by building 
different classifier architectures, as compared to our setting 
of different tasks with different labels. Hinton et al. lf5TTl 
used unsupervised learning to obtain an initial configuration 
of the parameters. This provides a good initialization and 
hence their multi-layered architecture does not suffer from 
local minimas during optimization. At a high-level, we 
can also look at our work as a multi-layered architecture 
(where each node typically produces complex outputs, e.g., 
labels over the pixels in the image); and initialization in 
our case comes from existing state-of-the-art individual 
classifiers. Given this initialization, our training procedure 
finds parameters that (consistently) improve performance 
across all the sub-tasks. 

3 Feedback Enabled Cascaded Clas- 
sification Models 

In the field of scene understanding, a lot of independent 
research into each of the vision sub-tasks has led to excel- 
lent classifiers. These independent classifiers are typically 
trained on different or heterogenous datasets due to the 
lack of ground-truth labels for all the sub-tasks. In addition, 
each of these classifiers come with their own learning and 
inference methods. Our goal is to consider each of them as 
a 'black-box', which makes it easy to combine them. We 
describe what we mean by 'black-box classifiers' below. 

Black-box Classifier. A black-box classifier, as the name 
suggests, is a classifier for which operations (such as 
learning and inference algorithms) are available for use, 
but their inner workings are not known. We assume that, 
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given the training dataset X, features extracted ^(X) 
and the target outputs of the i th task Yi, the black-box 
classifier has some internal learning function f{ eam with 
parameters Oi that optimizes the mapping from the inputs 
to the outputs for the training data. [] Once the parameters 
have been learnt, given a new data point, X with features 
Sk(X) G where K can be changed as desiredj^] the 
black-box classifier returns the output Yi according to its 
internal inference function ^ nfer . This is illustrated through 
the equations below. For the i th task, 

Learning : 0* = optimize /^^(^(X), Yi; Oi) (1) 

Inference : Yi = optimize (\P (X) , Yi ; 0* ) . (2) 

This approach of treating each classifier as a black-box al- 
lows us to use different existing classifiers which have been 
known to perform well at specific sub- tasks. Furthermore, 
without changing their inner workings, it allows us to com- 
pose them into one model which exploits the information 
from each sub-task to aid holistic scene understanding. 

3.1 Our Model 

Our model is built in the form of a two-layer cascade, 
as shown in Figure [2] The first layer consists of an 
instantiation of each of the black-box classifiers with the 
image features as input. The second layer is a repeated 
instantiation of each of the classifiers with the first layer 
classifier outputs as well as the image features as inputs. 
Note that the repeated classifier on the second layer is 
not necessary to have the same mathematical form with 
the one on the first layer. Instead, we consider it as a 
repeated instantiation only because they are used for the 
same classification task. 

Notation: We consider n related sub-tasks Classifier^, i G 
{1, 2, . . . , n} (Figure [2]). We describe the notations used in 
this paper as follows: 

^i(X) Features corresponding to Classifier^ extracted from 

image X. 

Zi, Z Zi indicates output from the first layer Classifier^. 

Many classifiers output continuous scores instead of 
labels. In cases where this is not the case, it is trivial 
to convert a binary classiers output to a log-odds 
scores. For a K-class (K > 2) classifier, we consider 
the output to be a K-dimensional vector. Z indicates 
the set {Z 1 ,Z 2 ,...,Z n }. 

Oi, 6i indicates parameters corresponding to first layer 

Classifier^. indicates the set . . . , n }- 

Yj, y Yj indicates output for the j th task in the second 

layer, using the original features ^j(X) as well 
as all the outputs from the first layer as input, y 
indicates the set {Yi , Y 2 , . . . , Y n }. 

ojj, Q ujj indicates parameters for the second layer 

Classifier^, fl indicates the set {uj±, . . . ,u) n }. 

Yj, T Dataset for the j th task, which consists of labeled 

pairs {X, Yj ; } in the training set. T represents all the 
labeled data. 

/infer' /learn tne internal inference function and learning function 

for the i th classifier on the first layer, 
/'infer' /'learn tne internal inference function and learning function 
for the i th classifier on the second layer. 

1. Unless specified, the regular symbols (e.g. X, Yi, etc.) are used for 
a particular data-point, and the bold-face symbols (e.g. X, Yi, etc.) are 
used for a dataset. 

2. If the input dimension of the black-box classifier can not be changed, 
then we will use that black-box in the first layer only. 



With the notations in place we will now first describe 
the inference and learning algorithms for the proposed 
model in the following sections, followed by probabilistic 
interpretation of our method. 

3.2 Inference Algorithm 

During inference, the inputs S&i(X) are given and our goal 
is to infer the final outputs Y^ Using the learned parameters 
Oi for the first level of classifiers and uji for the second 
level of classifiers, we first infer the first-layer outputs Zi 
and then infer the second-layer outputs Yi. More formally, 
we perform the following. 

Zi = optimize iin fer (^i(X), Zi\0i) (3) 

Zi 

Y t = optimize /7 nfer ([^(X) i],y<;a;<) (4) 

Yi 

The inference algorithm is given in Algorithm [T] This 
method allows us to use the internal inference function 
(Equation [2]) of the black-box classifiers without knowing 
its inner workings. Note that the complexity here is no 
more than constant times the complexity of inference in 
the original classiers. 

Algorithm 1 Inference 

1. Inference for first layer: 

for % = 1 : n 

Infer the outputs of the i th classifier using Equation [5] 
end 

2. Inference for second layer: 

for i = 1 : n 

Infer the outputs of the i th classifier using Equation]^ 
end 



3.3 Learning Algorithm 

During the training stage, the inputs \E^(X) as well as 
the target outputs, Yi,Y2,...,Y n of the second level of 
classifiers, are all observed (because the ground-truth labels 
are available). In our algorithm, we consider Z (outputs 
of layer 1 and inputs to layer 2) as hidden variables. In 
previous work, Heitz et al. ifTTIl assume that each layer is 
independent and that each layer produces the best output 
independently (without consideration for other layers), and 
therefore use the ground-truth labels for Z even for training 
the classifiers in the first layer. 

On the other hand, we want to optimize for the final 
outputs as much as possible. Thus the first layer classifiers 
need not perform their best (w.r.t. groundtruth), but rather 
focus on error modes that would result in the second 
layer's output (Yi, Y2 : . . . , Y n ) being more correct. There- 
fore, we learn the model through an iterative Expectation- 
Maximization formulation, given the independencies be- 
tween classifiers represented by the model in Figure [2] In 
one step (Feed-forward step) we assume the variables Z^'s 
are known and learn the parameters and in the other step 
(Feed-back step) we fix the parameters estimated previously 
and estimate the variables Z^'s. Since the Z^'s are not fixed 
to the ground truth, as the iterations progress, the first level 
of classifiers start focusing on the error modes which would 
give the best improvement in performance at the end of 
the second level of classifiers. The learning algorithm is 
summarized in Algorithm [2] 
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Algorithm 2 Learning 



1. Initialize latent variables Z with the ground-truth y. 

2. Do until convergence or maximum iteration: { 

Feed-foward step: Fix latent variables Z, estimate the parameters 6 
and Q using Equation [5] and Equation [6] 

Feed-back step: Fix the parameters and Q, compute latent variables 

Z using Equation^] 

} 

Initialization: We initialize the model by setting the latent 
variables Z^'s to the groundtruth, i.e. Z{ = Zf 1 . Training 
with this initialization, our cascade is equivalent to CCM in 
ifTTIl . where the classifiers (and the parameters) in the first 
layer are similar to the original state-of-the-art classifier 
and the classifiers in the second layer use the outputs of 
the first layer in addition to the original features as input. 

Feed-forward Step: In this step, we estimate the parame- 
ters 6 and Q. We assume that the latent variables Z^'s are 
known (and Y^'s are known because they are the ground- 
truth during learning, i.e. Y{ = Yf*). We then learn the 
parameters of each classifier independently. Learning 6^ is 
precisely the learning problem of the 'black-box classifier', 
and learning uji is also an instantiation of the original learn- 
ing problem, but with the original input features appended 
with the outputs of the first level classifiers. Therefore, we 
can use the learning method provided by the individual 
black-box classifier (Equation [T]). 

§i = optimize / i ; am (* i (X),Z,;6'0 (5) 



: optimize / / 1 z earn ([*i(X) Z],Yi',Ui) 



(6) 



We now have the parameters for all the classifiers. 

Feed-back Step: In the second step, we will estimate the 
values of the variables Z/s assuming that the parameters 
are fixed (and Y^'s are given because the ground-truth is 
available, i.e. Yi = Y?*). This feed-back step is the crux 
that provides information to the first-layer classifiers what 
error modes should be focused on and what can be ignored 
without hurting the final performance. Given 0^'s and c^'s 
are fixed, we want the Z^s to be good predictions from the 
first-layer classifiers and also help to increase the correction 
predictions of Y^'s as much as possible. We optimize the 
following function for the feed-back step: 



n 

optimize J2 (WiPO, Zi',6i) + Ji(MX)>Z> ^ 



(V) 



where J{'s and J^'s are functions respectively 
related to the first-layer classifiers and the 
second-layer classifiers, one option is to have 
JftttipO.Zi^) = /^(^(X),^) and 
WViiX^ZMw) = f'l^miX), Z],Yi;u>i) if 
the intrinsic inference functions for the classifiers are 
known. More discussions will be given in Section |3.4| if 
the intrinsic functions are unknown. The updated Z;'s 
will be used to re-learn the classifier parameters in the 
feed-forward step of next iteration. Note that the updated 
Zi's have continuous values. If the internal learning 
function of a classifier accepts only labels, we threshold 
the values of Z^'s to get labels. 



3.4 Probabilistic Interpretation 

Our algorithm can be explained with a probabilistic inter- 
pretation where the goal is to maximize the log-likelihood 
of the outputs of all tasks given the observed inputs, i.e., 
log P(y\X), where X is an image belonging to training 
set T. Therefore, the goal of the proposed model shown in 
Figure [2] is to maximize 

log n P(y\X;Q,n) (8) 

xer 

To introduce the hidden valuables Z^'s, we expand Equa- 
tion [8] as follows, using the independencies represented by 
the directed model in Figure [2] 



^]og£2p(Y u ...,Y n ,z\X',e,a) 



(9) 



(10) 



= ^g^]\p{Y^ x {x),z-^)p{z x \^ x {x)-e x 

xer z i=i 

However, the summation inside the log makes it difficult 
to learn the parameters. Motivated by the Expectation 
Maximization algorithm l53ll we iterate between the two 
steps as described in the following. Again we initialize the 
classifiers by learning the classifiers with ground-truth as 
discussed Section [331 



Feed-forward Step: In this step, we estimate the param- 
eters by assuming that the latent variables Z^'s are known 
(and Yj's are known anyway because they are the ground- 
truth). This results in 



maximize 

, . . . ,6 n ,oj\ ,. .. ,u 



xer i=i 

(11) 

Now in this feed-forward step, the terms for maximizing 
the different parameters turn out to be independent. So, for 
the i th classifier we have: 



maximize ^ log P(Zi\^i(X); Oi) 
% xer 

maximize log P(Yi\^i(X), Z; uji) 



(12) 



(13) 



xer 



Note that the optimization problem nicely breaks down 
into the sub-problems of training the individual classifier 
for the respective sub-tasks. We can solve each sub-problem 
separately given the probabilistic interpretation of the cor- 
responding classifier. When the classifier is taken as 'black- 
box', this can be approximated using the original learning 
method provided by the individual black-box classifier 
(Equation [5] and Equation [6]) 

Feed-back Step: In this step, we estimate the values 
of the latent variables Z^'s assuming that the parameters 
are fixed. We perform MAP inference on Z^'s (and not 
marginalization). This can be considered as a special variant 
of the general EM framework (hard EM, |54|). Using 



Equation [TOj we get the following optimization problem: 

, Y n , Z\X\ 01, . . . , n , UJi, . . . , tin) <^4> 



maximize log P(Yi , . 



maximize 

z 



(logP(Y i |^(X),Z;c2; i ) + logP(Z i |^(X);0 i 



(14) 

This maximization problem requires that we have access to 
the characterization of the individual black-box classifiers 
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in a probabilistic form. If the probabilistic interpretations 
of the classifiers are known, we can solve the above 
function accordingly. Note that Equation [14] is same as 
Equation with jj(^(X),^;^) = logP(^|^(X); 0») 
and J^(X), Z, Y z ; uj % ) = log P(F,|^(X), Z; 

In some cases, the classifier log-likelihoods in Equa- 
tion [14] actually turn out to be convex. For example, if the 
individual classifiers are linear or logistic classifiers, the 
minimization problem is convex and can be solved using 
gradient descent (or any such method). 

However, if the probabilistic interpretations of the clas- 
sifiers are unknown, the feedback step requires extra mod- 
eling. Some modeling options are provided as follows: 

• Case 1: Insight into the vision problem is available. In 
this case, one could use the domain knowledge of the 
task into the problem to properly model J{'s and J^'s. 

• Case 2: No insight into the vision problem is available 
and no internal function of the original classifier is 
known. In this case, we formulate the J{'s and J^'s 
as follows. The J] is defined to be a distance function 
between the target Z{ and the estimated Z^ which serves 
as a regularization for the first-layer classifiers. 

ji(MX),ZnOi) = ||z z -zJ| 2 

i ' (15) 

s.t. Zi = optimize /j^ er (\I/;(X),Z;;0;) 

Zi 

To formulate J^'s, we make a variational approximation 
on the output of the second-layer classifier for task i (i.e., 
approximating it as a Gaussian, l55l ) to get: 

minimize | |y* - af [^(X), Z] 1 1 (15) 

where a$ are parameters of the approximation model. Yi 
is the actual output of the second layer classifier for the 
task i, i.e. Yi = optimize y . /'i n fer([^PO i],^;^)- 
Then we define the J|'s as follows. 



\Yi-cZ[*i{X),Z\\ 



(17) 



Sparsity: Note that the parameter is typically ex- 
tremely high-dimensional (and increases with the number 
of tasks) because the second layer classifiers take as input 
the original features as well as outputs of all previous 
layers. The learning for the approximation model may 
become ill-conditioned. Therefore, we want our model 
to select only a few non-zero weights, i.e., only a few 
non-zero entries in c^. We do this by introducing the 
l\ sparsity in the parameters l56l . So Equation 16 is 
extended as follows. 



y<-af[^i(X),Z] +f3\ai\) (18) 



minimize >^ 

OLi ^ 

Inference: As introduced in Section |3.2[ our inference 
procedure consists of two steps: first maximize over hidden 
variable Z and then maximize over Y. 

Z = argmaxlogP(2:|X,e) (19) 



y 



argmaxlogP(.Z|X,e) 

z 

argmaxlogP(y|i,X, Cl) 
y 



(20) 



3. Another alternative would have been to maximize P(Y\X) = 
J2 Z P(Y, Z\X); however, this would require marginalization over the 
variable Z which is expensive to compute. 



Given the structure of our directed graph, the outputs 
for different classifiers on the same layer are independent 
given their inputs and parameters. Therefore, Equations [T9 



and 20 are equivalent to the following: 



Z % = argmaxlogP(Z,|^pO;6g,i = 1, 



(21) 
(22) 



Yi = aigmzxlog P(Yi\^i(X)-Z\uji),i= l,...,n 

As we see, Equation [2T] and Equation [22] are instantiations 
of Equation [3] and Equation [4] in the probabilistic form. 

4 Training with Heterogeneous 
datasets 

Often real datasets are disjoint for different tasks, i.e, each 
datapoint does not have the labels for all the tasks. Our 
formulation handles this scenario well. In this section, we 
show our formulation for this general case, where we use 
I\ as the dataset that has labels only for the i th task. 

In the following we provide the modifications to the feed- 
forward step and the feed-back step while dealing with 
disjoint datasets, i.e., data in dataset I\ only have labels for 
the i th task. These modifications also allow us to develop 
different variants of the model, described in Section 14.11 



Feed-forward Step: Using the feedback step, we can have 
Zi's for all the data. Therefore, we use all the datasets 
in order to re-learn each of the first-layer classifiers. If 
the internal learning function of the black-box classifier is 
additive over the data points, then we have 

Ot = optimize^ YI ^/iearn(^W,^;^) 5 (23) 

where ttj's are the importance factors given to different 
datasets, and satisfy Y^j = 1- (See Section 
to choose 7Tj's.) 

If the internal learning function is not additive over the 
data points, we provide an alternative solution here. We 
sample a subset of data X J from each dataset T J , i.e. X J C 
T J and combine them into a new set X = [X 1 , . . . , X n ]. 
In X, the ratio of data belonging to X J is equal to ttj, 
= 7r, , where | • | indicates the number of data- 



4.1 



on how 



i.e. 



points. Then we can learn the parameters of the first-layer 
classifiers as follows. 

ft = optimize / i ; am (* i (X),Z i ;^), (24) 

To re-learn the second-layer classifiers, the only change 
made to Equation [6] is that instead of using all data while 
optimizing for a particular task, we use only the data-points 
that have the ground-truth label for the corresponding task. 

u t = optimize / feam([*i(X) Z], Y*;^), s.t X = F t (25) 

Feed-back Step: In this step, we change Equation [7] as 
follows. Since a datapoint in the set Tj only has ground- 
truth label for the j th task (Yj), we only consider J 3 2 in 
the second term. However, since this datapoint has outputs 
for all the first-layer classifiers using the feed-forward step, 
we consider all the JJ's, i = 1, • • • , n. Therefore, in order 
to obtain the value of Z corresponding to each data-point 
X G Tj, we have 



optimize Y (A (*j (X) , Zi ; &)) + 4 (*i (X),Z, Y 3 : 



(26) 
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4.1 FECCM: Different Instantiations 

The parameters ttj allow us to formulate three different 
instantiations of our model. 

• Unified FECCM: In this instantiation, our goal is to 
achieve improvements in all tasks with one set of param- 
eters {6, Q}. We want to balance the data from different 
datasets (i.e., with different task labels). Towards this 
goal, TTj is set to be inversely proportional to the amount 
of data in the dataset of the j th task. Therefore, the 
unified FECCM balances the amount of data in different 



datasets, based on Equation 23 



• One-goal FECCM: In this instantiation, we set ttj = 1 
if j = k, and ttj = otherwise. This is an extreme setting 
to favor the specific task k. In this case, the retraining of 
the first-layer classifiers will only use the feedback from 
the Classifier^ on the second layer, i.e., only use the 
dataset with labels for the k th task. Therefore, FECCM 
degrades to a model with only one target task (the k th 
task) on the second layer and all the other tasks are only 
instantiated on the first layer. Although the goal in this 
setting is to completely benefit the k th task, in practice 
it often results in overfitting and does not always achieve 
the best results even for the specific task (see Table [T] in 
Section [6}. In this case, we train different models, i.e. 
different 6Vs and lj^s, for different target tasks. 

• Target-Specific FECCM: This instantiation is to opti- 
mize the performance of a specific task. As compared to 
one-goal FECCM where we manually remove the other 
tasks on the second layer, in this instantiation we keep 
all the tasks on the second layer and conduct data-driven 
selection of the parameters ttj for different datasets. 
In detail, ttj is selected through cross validation on a 
hold-out set in the learning process in order to optimize 
the second-layer output of a specific task. Since Target- 
Specific FECCM still has all the tasks instantiated on the 
second layer, the re-training of the first-layer classifiers 
can still use data from different datasets (i.e., with 
different task labels). Here we train different models, i.e. 
different ^'s and uVs, for different target tasks. 

5 Scene Understanding: Implementa- 
tion 

In this section we describe the implementation details of our 
instantiation of FE-CCM for scene understanding. Each 
of the classifiers described below for the sub-tasks are 
our "base-model" shown in Table [T] In some sub-tasks, 
our base-model will be simpler than the state-of-the-art 
models (that are often hand-tuned for the specific sub-tasks 
respectively). However, even when using base-models in 
our FE-CCM, our model will still outperform the state- 
of-the-art models for the respective sub-tasks (on the same 
standard respective datasets) in Section [6] 

In order to explain the implementation details for the 
different tasks, we will use the following notation. Let 
i be the index of the tasks we consider. We consider 6 
tasks for our experiments on scene understanding: scene 
categorization (i = 1), depth estimation (i = 2), event 



categorization (i — 3), saliency detection (i = 4), object 
detection (i = 5) and geometric labeling (i = 6). The inputs 
for the j th task at the first layer are given by the low-level 
features 9 j . At the second layer, in addition to the original 
features the inputs include the outputs from the first 
layer classifiers. This is given by, 

§j = [9 j Z 1 Z 2 Z 3 Z 4 Z 5 Z 6 ] (27) 
where, is the input feature vector for the j th task on the 
second layer, and Zi (i = 1 , . . . , 6) represents the output 
from the i th task which is appended to the input to the j th 
task on the second layer and so on. 

Scene Categorization. For scene categorization, we clas- 
sify an image into one of the 8 categories defined by 
Torralba et al. l57l tall building, inside city, street, highway, 
coast, open country, mountain and forest. We evaluate the 
performance by measuring the rate of incorrectly assigning 
a scene label to an image on the MIT outdoor scene dataset 
I571 . The feature inputs for the first-layer scene classifier 
G M 512 is the GIST feature E), extracted at 4 x 4 
regions of the image, on 4 scales and 8 orientations. 

We use an RBF-Kernel SVM classifier (58), as the first- 
layer scene classifier, and a multi-class logistic classifier 
for the second layer. The output of the first-layer scene 
classifier Z\ G R 8 is an 8 -dimensional vector where each 
element represents the log-odds score of the corresponding 
image belonging to a scene category. This 8 -dimensional 
output is fed to each of the second-layer classifiers. 

Depth Estimation. For the single image depth estimation 
task, we estimate the depth of every pixel in an image. 
We evaluate the estimation performance by computing the 
root mean square error of the estimated depth with respect 
to ground truth laser scan depth using the Make3D Range 
Image dataset (59l [60). We uniformly divide each image 
into 55 x 305 patches as [59]. The feature inputs for the 
first-layer depth estimation ^ 2 G ^ 104 are features which 
capture texture, color and gradient properties of the patch. 
This is obtained by convolving the image with Laws' masks 
and computing the energy and Kurtosis over the patch along 
with the shape features as described by Saxena et al. (59). 

We use a linear regression for the first-level and second- 
level instantiation of the depth estimation module. The 
output of the first-layer depth estimation Z 2 G M+ is the 
predicted depth of each patch in the image. In order to feed 
the first-layer depth output to the second-layer classifiers, 
for the scene categorization and event categorization tasks, 
we use a vector with the predicted depth of all patches in 
the image; for the other tasks, we use the 1 -dimensional 
predicted depth for the patch/pixel/bounding-box, etc. 

Event Categorization: For event categorization, we clas- 
sify an image into one of the 8 sports events as defined by 
Li et al. [46 1 : bocce, badminton, polo, rowing, snowboard- 
ing, croquet, sailing and rock-climbing. For evaluation, we 
compute the rate of incorrectly assigning an event label 
to an image. The feature inputs for the first-layer event 
classifier \I/ 3 G M 43 is a 43-dimensional feature vector, 
which includes the top 30 PCA projections of the 512- 
dimensional GIST features |6H . the 12-dimension global 
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color features (mean and variance of RGB and YCrCb color 
channels over the entire image), and a bias term. 

We use a multi-class logistic classifier on each layer 
for event classification. The output of the first-layer event 
classifier Z 3 G M 8 is an 8-dimensional vector where each 
element represents the log-odd score of the corresponding 
image belonging to a event category. This 8-dimensional 
output is feed to each of the second-layer classifiers. 

Saliency Detection. The goal of the saliency detection task 
is to classify each pixel in the image as either salient or 
non- salient. We use the saliency detection dataset used by 
Achanta et. al. (62] for our experiments. The feature inputs 
for the first-layer saliency classifier ^4 G M 4 includes the 
3 -dimensional color-offset features based on the Lab color 
space as described by Achanta et al. (62l and a bias term. 

We use a logistic model for the saliency estimation 
classifiers on both layers. The output of the first-layer 
saliency classifier Z4 G R is the log-odd score of a 
pixel being salient. In order to feed the first-layer saliency 
detection output to the second-layer classifiers, for the scene 
categorization and event categorization tasks, we form a 
vector with the predicted saliency of all the pixels in the im- 
age; for the other tasks, we use the 1 -dimensional average 
saliency for the corresponding pixel/patch/bounding-box. 

Object Detection. We consider the following object cate- 
gories: car, person, horse and cow. We use the train-set and 
test-set of PASCAL 2006 [63 ] for our experiments. Our 
object detection module builds on the part-based detector 
of Felzenszwalb et. al. (64). We first generate 5 to 100 
candidate windows for each image by applying the part- 
based detector with a low threshold (over-detection). The 
feature inputs for the first-layer object detection classifier 
\I>5 G R K are the HOG features extracted based on the 
candidate window as [65 ] plus the detection score from 
the part-based detector |f64l. K depends on the number of 
scales to be considered and the size of the object template. 

We learn an RBF-kernel SVM model as the first layer 
classifier. The classifier assigns each window a +1 or 
label indicating whether the window belongs to the 
object or not. For the second-layer classifier, we learn a 
logistic model over the feature vector constituted by the 
outputs of all first-level tasks and the original HOG feature. 
We use average precision to quantitatively measure the 
performance. The output of the first-layer object detection 
classifier Z 5 G M 4 are the estimated or 1 labels for a 
region to belong to the 4 object categories we consider. In 
order to feed the first-layer object detection output to the 
second-layer classifiers, we first generate a detection map 
for each object. Pixels inside the estimated positive boxes 
are labeled as otherwise they are labeled as "0". For 
scene categorization and event categorization on the second 
layer, we feed all the elements on the map; for the other 
tasks, we use the 1 -dimensional average value on the map 
for the corresponding pixel/patch/bounding-box. 

Geometric labeling. The geometric labeling task refers 
to assigning each pixel to one of three geometric classes: 
support, vertical and sky, as defined by Hoiem et al. l33l . 



For evaluation, we compute the accuracy of assigning the 
correct geometric label to a pixel. The feature inputs for 
the first-layer geometry labeling classifier \I/ 6 G M 52 are the 
region-based features as described by Hoiem et al. (33). 

We use the dataset and the algorithm by (33l as the first- 
layer geometric labeling module. To reduce the computation 
time, we avoid the multiple segmentations and instead use 
a single segmentation with 100 segments per image. We 
use a logistic model as the second-layer classifier. The 
output of the first-layer geometry classifier Zq G M 3 is 
a 3 -dimensional vector with each element representing the 
log-odd score of the corresponding pixel belonging to a 
geometric category. In order to feed the first-layer geom- 
etry output to the second-layer classifiers, for scene/event 
categorization we form a vector with the predicted scores 
of all pixels; for the other tasks we use the 3 -dimensional 
vector with each element representing the average scores 
for the corresponding pixel/patch/bounding-box. 

6 Experiments and Results 

6.1 Experimental Setting 

The proposed FE-CCM model is a unified model which 
jointly optimizes for all the sub-tasks. We believe this is 
a powerful algorithm in that, while independent efforts 
towards each sub-task have led to state-of-the-art algorithms 
that require intricate modeling for that specific sub-task, the 
proposed approach is a unified model which can beat the 
state-of-the-art performance in each sub-task and, can be 
seamlessly applied across different applications. 

We evaluate our proposed method on combining six tasks 
introduced in Section [5] In our experiment, the training of 
FE-CCM takes 4-5 iterations. For each of the sub-tasks 
in each of the domains, we evaluate our performance on 
the standard dataset for that sub-task (and compare against 
the specifically designed state-of-the-art algorithm for that 
dataset). Note that, with such disjoint yet practical datasets, 
no image would have ground truth available for more than 
one task. Our model handles this well. 

In experiment we evaluate the following algorithms as 
shown in Table [T] 

• Base model: Our implementation (Section [5]) of each 
sub-task, which serves as a base model for our FE- 
CCM. (The base model uses less information than 
state-of-the-art algorithms for some sub-tasks.) 

• All-features-direct: A classifier that takes all the fea- 
tures of all sub-tasks, appends them together, and 
builds a separate classifier for each sub-task. 

• State-of-the-art model: The state-of-the-art algorithm 
for each sub-task respectively on that specific dataset. 

• CCM: The cascaded classifier model by Heitz 
et al. ifTTIl . which we re-implement for six sub-tasks. 

• FE-CCM (unified): This is our proposed model. Note 
that this is one single model which maximizes the joint 
likelihood of all the sub-tasks. 

• FE-CCM (one goal): In this case, we have only one 
sub-task instantiated on the second layer, and the goal 
is to optimize the outputs of that sub-task. We train a 
specific one-goal FE-CCM for each sub-task. 



TABLE 1 

Summary of results for the SIX vision tasks. Our method improves performance in every single task. (Note: Bold face corresponds to our 

model performing equally with or better than state-of-the-art.) 
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Results for the six tasks in scene understanding. Top: the performance for event categorization, scene categorization, saliency 
, geometric labeling, and depth estimation. Bottom: the average performance for object detection and the performance for the 
of individual object categories: car, person, horse, and cow. Each figure compares four methods: all-features-direct method, state- 
methods, CCM, and the proposed FE-CCM method. 



FE-CCM (target specific): In this case, we train a 
specific FE-CCM for each sub-task, by using cross- 



validation to estimate ttj's in Equation 23 Different 
values for 7r/s result in different parameters learned 
for each FE-CCM. 

Note that both CCM and All-features-direct use information 
from all sub-tasks, and state-of-the-art models also use care- 
fully designed models that implicitly capture information 
from the other sub-tasks. 

6.2 Datasets 

The datasets used are mentioned in Section [5j and the 
number of test images in each dataset is shown in Table [T] 
For each dataset we use the same number of training 
images as the state-of-the-art algorithm (for comparison). 
We perform 6-fold cross validation on the whole model 
with 5 of 6 sub-tasks to evaluate the performance on each 
task. We do not do cross-validation on object detection as 
it is standard on the PASCAL 2006 [63 1 dataset (1277 train 
and 2686 test images respectively). 

6.3 Results 

To quantitatively evaluate our method for each of the sub- 
tasks, we consider the metrics appropriate to each of the 
six tasks in Section [5] Table [T] and Figure [3] show that FE- 
CCM not only beats state of the art in all the tasks but also 
does it jointly as one single unified model. 

In detail, we see that all-features-direct improves over 
the base model because it uses features from all the tasks. 



The state-of-the-art classifiers improve on the base model 
by explicitly hand-designing the task specific probabilistic 
model (46j |59) or by using adhoc methods to implicitly use 
information from other tasks (33). Our FE-CCM model, 
which is a single model that was not given any manually 
designed task-specific insight, achieves a more significant 
improvement over the base model. 

We also compare the three instantiations of FE-CCM 
in Table [T] (the last three rows). We observe that the 
target- specific FE-CCM achieves the best performance, by 
selecting a set of ttj's to optimize for each task inde- 
pendently. Though the unified FE-CCM achieves slightly 
worse performance, it jointly optmizes for all the tasks by 
training only one set of parameters. The performance of 
one-goal FE-CCM is less stable compared to the other two 
instantiations. It is mainly because the first-layer classifiers 
only gain feedback from the specific task on the second 
layer in one-goal FE-CCM, which easily causes overfitting. 

We note that our target- specific FE-CCM, which is 
optimized for each task independently and achieves the best 
performance, is a more fair comparison to the state-of-the- 
art because each state-of-the-art model is trained specifi- 
cally for the respective task. Furthermore, Figure [3] shows 
the results for CCM (which is a cascade without feedback 
information) and all-features-direct (which uses features 

4. The state-of-the-art method for depth estimation in (59) follows a 
slightly different testing procedure. In that case, our target- specific FE- 
CCM method achieves RMSE = 15.3. 
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Fig. 4. Results showing improvement using the proposed model. From top to bottom: Depth estimation, Saliency detection, Object 
detection, Geometric labeling. All depth maps in depth estimation are at the same scale (black means near and white means far); Salient 
region in saliency detection are indicated in cyan; Geometric labeling: Green=Support, Blue=Sky and Red=Vertical (Best viewed in color). 
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TABLE 2 

Summary of results for combining scene categorization and object 
detection, with partially-labeled datasets and fully-labeled datasets. 



(a) % % (b) 

Fig. 5. Confusion matrix for (a) Event categorization; (b) Scene 
categorization; (c) Geometric labeling. All the results are gained 
with the proposed FE-CCM method. The average accuracy achieved 
by the proposed FE-CCM model outperforms the state-of-the-art 
methods for each of these tasks, as listed in Table[T| 

from all the tasks). This indicates that the improvement is 
strictly due to the proposed feedback and not just because 
of having more information. 

We show some visual improvements due to the proposed 
FE-CCM in Figure [4] In comparison to CCM, FE-CCM 
leads to better depth estimation of the sky and the ground, 
and it leads to better coverage and accurate labeling of 
the salient region in the image, and it also leads to better 
geometric labeling and object detection. Figure [5] also 
provides the confusion matrices for the three tasks: scene 
categorization, event categorization, geometric labeling. 

Figure [6] provides scatter plots of the performance differ- 
ence for each image between the unified FE-CCM method 
and the all-features-direct method, respectively for the 
tasks of geometric labeling, saliency detection, and depth 
estimation. We note that for all three tasks, the unified FE- 
CCM outperforms the all-features-direct method on most 
images. For geometric labeling and saliency detection, the 
improvement from the unified FE-CCM method is mainly 
due to large improvements on some images. For depth 
estimation, the improvement is scattered over many images. 

The cause of improvement. We have shown improvements 
of FE-CCM in Table [T] under the situation of heterogeneous 
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datasets. The improvement can be caused by one or both 
of the following reasons: (1) the feedback process finds 
better error modes for the first-layer classifiers; (2) the 
feedback generates additional "labels" to retrain the first- 
layer classifiers. In order to analyze this, we consider the 
two tasks of scene recognition and object detection on the 
DS1 dataset in |[TT1l . which contains ground-truth labels for 
both the tasks. We compare the various methods under two 
settings: (1) train with the fully-labeled data; (2) train with 
only the scene labels for one half of the training data and 
only the object labels for the second half. Table [2] compares 
the performance of training with partially-labeled datasets 
and the performance of different methods under these two 
settings. The experiments are performed using 5 -fold cross 
validation. The unified FE-CCM method outperforms the 
other methods under both partially-labeled and fully-labeled 
situations. We note that all methods listed perform better 
when full labels are provided. In fact, FE-CCM achieves 
close performance in both settings. We also note that the 
FE-CCM method trained with partially-labeled datasets 
outperforms the CCM method trained with fully-labeled 
datasets, which indicates that the improvement achieved by 
the FE-CCM method is not simply from generating more 
labels for training the first-layer classifiers, but also due to 
finding useful modes for the first-layer classifiers. 

Figure [7] illustrates the first-layer outputs of a test image, 
respectively at initialization and at the 5 th iteration. Our 
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Fig. 6. Performance difference between the proposed unified FE-CCM method and the all-features-direct method for each test image, 
respectively for the tasks of geometric labeling, saliency detection, depth estimation, on one of the cross-validation folds. 
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Fig. 7. Illustration of the FE-CCM first-layer outputs for a single image, (a) the input image from the sports-event dataset. Its groundtruth 
event label is "Bocce". (b-g)Outputs of the first-layer classifiers, at initialization (top row) and at the 5 th iteration (bottom row), (h) Outputs 
of the second-layer event classifier. Note that at initialization the first-layer classifiers are trained using ground-truth labels, i.e. the same as 
CCM. In (b)(c)(d)(e)(h), Red=High-value, Blue=low-value. In (f), Blue=Ground, Green=Vertical, Red=Sky. In (g) Red=Object Presence. (Best 
viewed in color.) 



initialization is the same as CCM, i.e., using ground- 
truth labels to train the first-layer classifiers. We note that 
with feedback, the first-layer output shifts to focus on 
more meaningful modes, e.g., At initialization, the event 
classifier has widespread confusion with other categories. 
With feedback, the event classifier turns to be confused 
with only the 'rock-climbing' and 'croquet' events which 
are more similar to 'bocce'. Moreover, the first-layer scene, 
depth, and object classifiers also give more meaningful 
predictions while trained with feedback. With better first- 
layer predictions, our FE-CCM correctly classifies the event 
as 'bocce', while CCM misclassifies it as 'rowing'. 

6.4 Discussion 

FE-CCM allows each classifier in the second layer to learn 
which information from the other first-layer sub-tasks is 
useful, and this can be seen in the learned weights Q for 
the second-layer. We provide a visualization of the weights 
for the six vision tasks in Figure [8ja). We see that the 
model agrees with our intuitions that large weights are 
assigned to the outputs of the same task from the first layer 
classifier (see the large weights assigned to the diagonals 
in the categorization tasks), though saliency detection is an 
exception which depends more on its original features (not 
shown here) and the geometric labeling output. We also 
observe that the weights are sparse. This is an advantage of 
our approach since the algorithm automatically figures out 
which outputs from the first level classifiers are useful for 
the second level classifier to achieve the best performance. 

Figure [SJb) provides a closer look to the positive weights 
given to the various outputs for a second-level geomet- 
ric classifier. We observe that large positive weights are 
assigned to "mountain", "forest", "tall building", etc. for 
supporting the geometric class "vertical", and similarly 
"coast", "sailing" and "depth" for supporting the "sky" 



class. These illustrate some of the relationships the model 
learns automatically without any manual intricate modeling. 

Figure [8jc) visualizes the weights given to the depth 
attributes (first-layer depth outputs) for the task of event 
categorization. Figure [8jd) shows the same for the task 
of scene categorization. We see that the depth plays an 
important role in these tasks. In Figure[8jc), we observe that 
most event categories rely on the middle part of the image, 
where the main objects of the event are often located. 
E.g., most of the "polo" images have horses and people 
in the middle of the image while many "snowboarding" 
images have people jumping in the upper-middle part. For 
scene categorization, most of the scene categories (e.g., 
coast, mountain, open country) have sky in the top part, 
which is not as discriminative as the bottom part. In scene 
categories of tall buildings and street, the upper part of the 
street consists of buildings, which discriminates these two 
categories from the others. Not surprisingly, our method 
had automatically figured this out (see Figure [8jd)). 

Stability of the FE-CCM algorithm: In this paper, we 
have presented results for six sub-tasks. In order to find 
out how our method scales with different combination and 
number of sub-tasks, we have tried several combinations, 
and in each case we get consistent improvement in each 
sub-task. For example, in our preliminary experiments, we 
combined depth estimation and scene categorization and 
our reduction in error are 12.0% and 13.2% respectively. 
Combining scene categorization and object detection gives 
us 15.4% and 10.2% respective improvements (Table [2]). 
We then combined four tasks: event categorization, scene 
categorization, depth estimation, and saliency detection, and 
got improvements in all these sub-tasks l22l . Finally, we 
also combined different tasks for robotic applications, and 
the performance improvement was similar. 
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Fig. 8. (a) The absolute values of the weight vectors for second-level classifiers, i.e. cj. Each column shows the contribution of the various 
tasks towards a certain task, (b) Detailed illustration of the positive values in the weight vector for a second-level geometric classifier. (c)(d) 
Illustration of the importance of depths in different regions for predicting different events (c) and scenes (d). An example image for each class 
is also shown above the map of the weights. (Note: Blue is low and Red is high. Best viewed in Color). 

TABLE 3 

Summary of results for the the robotic grasping experiment. Our 
method improves performance in every single task. 
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Fig. 9. Examples in the dataset used for the grasping robot 
experiments. The two tasks considered were a six-class, object 
classification task and grasping point detection task. 

7 Robotic Applications 

In order to show the applicability of our FE-CCM to 
different scene understanding domains, we also used the 
proposed method in multiple robotic applications. 

7.1 Robotic Grasping 

Given an image and a depthmap (Figure [9]), the goal of the 
learning algorithm in a grasping robot is to select a point 
to grasp the object (this location is called the grasp point, 
[ 66 ]). It turns out that different categories of objects demand 
different strategies for grasping. In prior work, Saxena et 
al. (66J [67) did not use object category information for 
grasping. In this work, we use our FE-CCM to combine 
object classification and grasping point detection. 

Implementation: We work with the labeled synthetic 
dataset by Saxena et al. [ 66 ] which spans 6 object categories 
and also includes an aligned pixel level depth map for each 
image, as shown in Figure [9] The six object categories 
include spherically symmetric objects such as cerealbowl, 
rectangular objects such as eraser, martini glass, books, 
cups and long objects such as pencil. 

For grasp point detection, we compute image and 
depthmap features at each point in the image (using codes 
given by [66]). The features describe the response of the 
image and the depth map to a bank of filters (similar 
to Make3D) while also capturing information from the 
neighboring grid elements. We then use a regression over 
the features. The output is a confidence score for each point 
being a good grasping point. In an image, we pick the point 
with the highest score as the grasping point. 

For object detection, we use a logistic classifier to 
perform the classification. The output of the classifier is a 6- 
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Fig. 1 0. Left: the grasping point detected by our algorithm. Right: 
Our robot grasping an object using our algorithm. 

dimensional vector representing the log-odds score for each 
category. The final classification is performed by assigning 
the image to the category with the highest score. 

Results: We evaluate our algorithm on a dataset published 
in l66lk and perform cross-validation to evaluate the perfor- 
mance on each task. We use 6000 images for grasping point 
detection (3000 for training and 3000 for testing) and 1200 
images for object classification (600 for training and 600 for 
testing). Table [3] shows the results for our algorithm's ability 
to predict the grasping point, given an image and the depths 
observed by the robot using its sensors. We see that our 
FE-CCM obtains significantly better performance over all- 
features-direct and CCM (our implementation). Figure 10 
shows an example of our robot grasping an object. 

7.2 Object-finding Robot 

Given an image, the goal of an object-finding robot is 
to find a desired object in a cluttered room. As we have 
discussed earlier, some types of scenes such as living room 
are more likely to have objects (e.g., shoes) than other 
types of scenes such as kitchen. Similarly, office scenes 
are more likely to contain tv-monitors than kitchen scenes. 
Furthermore, it is also intuitive that shoes are more likely 
to appear on the supportive surface such as floor, instead 
of the vertical surface such as the wall. Therefore, in this 
work, we use our FE-CCM to combine object detection 
with indoor scene categorization and geometric labeling. 

Implementation: For scene categorization, we use the 
indoor scene subsets in the Cal-Scene Dataset [68] and 




Fig. 1 1 . Left: the shoe-finding robot, which has a camera to take 
photos of a scene. Right: the shoed detected using our algorithm. 

classify an image into one of the four categories: bedroom, 
living room, kitchen and office. For geometric labeling, 
we use the Indoor Layout Data 1 69 1 and assign each 
pixel to one of three geometry classes: ground, wall and 
ceiling. We use the same features and classifiers for scene 
categorization as in Section [5] 

For object detection, we use the PASCAL 2007 Dataset 
1701 and our own shoe dataset to learn detectors for four 
object categories: shoe, dining table, tv-monitor, and sofa. 
We first use the part-based object detection algorithm in 
1381 to create candidate windows, and then use the same 
classifiers as described in Section |5j 

Results: We use this method to build a shoe-finding robot, 
as shown on Figure [TT]-left. With a limited number of 
training images (86 positive images in our case), it is hard 
to train a robust shoe detector to find a shoe far away 
from the camera. However, using our FE-CCM model, the 
robot learns to leverage the other tasks and performs more 
robust shoe detection. Figure [TT]-right shows a successful 
detection. For more details and videos, please see fTTIl . 

8 Conclusions 

We propose a method for combining existing classifiers for 
different but related tasks in scene understanding. We only 
consider the individual classifiers as a 'black-box' (thus not 
needing to know the inner workings of the classifier) and 
propose learning techniques for combining them (thus not 
needing to know how to combine the tasks). Our method 
introduces feedback in the training process from the later 
stage to the earlier one, so that a later classifier can provide 
the earlier classifiers information about what error modes 
to focus on, or what can be ignored without hurting the 
joint performance. 

Our extensive experiments show that our unified model (a 
single FE-CCM trained for all the sub-tasks) improves per- 
formance significantly across all the sub-tasks considered 
over the respective state-of-the-art classifiers. We show that 
this was the result of our feedback process. The classifier 
actually learns meaningful relationships between the tasks 
automatically. We believe that this is a small step towards 
holistic scene understanding. 
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